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Summary: Genomic instability, the propensity of aberrations in chromosomes, plays a critical role 
in the development of many diseases. High throughput genotyping experiments have been performed 
to study genomic instability in diseases. The output of such experiments can be summarized as high 
dimensional binary vectors, where each binary variable records aberration status at one marker 
locus. It is of keen interest to understand how these aberrations interact with each other. In this 
paper, we propose a novel method, LogitNet, to infer the interactions among aberration events. The 
method is based on penalized logistic regression with an extension to account for spatial correlation 
in the genomic instability data. We conduct extensive simulation studies and show that the proposed 
method performs well in the situations considered. Finally, we illustrate the method using genomic 
instability data from breast cancer samples. 
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1. Introduction 

Genomic instability refers to the propensity of aberrations in chromosomes such as mutations, 
deletions and amplifications. It has been thought to play a critical role in the development 
of many diseases, for example, many types of cancers (Klein and Klein 1985). Identifying 
which aberrations contribute to disease risk, and how they may interact with each other 
during disease development is of keen interest. High throughput geno typing experiments 
have been performed to interrogate these aberrations in diseases, providing a vast amount 
of information on genomic instabilities on tens of thousands of marker loci simultaneously. 
These data can essentially be organized as a n x p matrix where n is the number of samples, 
p is the number of marker loci, and the {i,jy^ element of the matrix is the binary aberration 
status for the ith sample at the jth locus. We refer to the interactions among aberrations as 
oncogenic pathways. Our goal is to infer oncogenic pathways based on these binary genomic 
instability profiles. 

Oncogenic pathways can be compactly represented by graphs, in which vertices represent 
p aberrations and edges represent interactions between aberrations. Tools developed for 
graphical models (Lauritzen 1996) can therefore be employed to infer interactions among p 
aberrations. Specifically, each vertex represents a binary random variable that codes aberra- 
tion status at a locus, and an edge will be drawn between two vertices if the corresponding 
two random variables are conditionally dependent given all other random variables. Here, 
we want to point out that graphical models based on conditional dependencies provide 
information on "higher order" interactions compared to other methods (e.g., hierarchical 
clustering) which examine the marginal pairwise correlations. The latter does not tell, for 
example, whether a non-zero correlation is due to a direct interaction between two aberration 
events or due to an indirect interaction through a third intermediate aberration event. 

There is a rich literature on fitting graphical models for a limited number of variables (see 
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for example Dawid and Lauritzen 1993; Whittaker 1990; Edward 2000; Drton and Perlman 
2004, and references therein). However, in genomic instability profiles, the number of genes 
p is typically much larger than the number of samples n. Under such high-dimension-low- 
sample-size scenarios, sparse regularization becomes indispensable for purposes of both model 
tractability and model interpretation. Some work has already been done to tackle this chal- 
lenge for high dimensional continuous variables. For example, Meinshausen and Buhlmann 
(2006) proposed performing neighborhood selection with lasso regression (Tibshirani 1996) 
for each node. Peng et al. (2009a) extended the approach by imposing the sparsity on the 
whole network instead of each neighborhood, and implemented a fast computing algorithm. 
In addition, a penalized maximum likelihood approach has been carefully studied by Yuan 
and Lin (2007), Friedman et al. (2007b) and Rothman et al.(2008), where the p variables were 
assumed to follow a multivariate normal distribution. Besides these cited works, various other 
regularization methods have also been developed for high dimensional continuous variables 
(see for example, Li and Gui 2006 and Schafer and Strimmer 2007). Bayesian approaches 
have been proposed for graphical models as well (see for example, Madigan et al. 1995). 

In this paper, we consider binary variables and propose a novel method, LogitNet, for 
inferring edges, i.e., the conditional dependence between pairs of aberration events given all 
others. Assuming a tree topology for oncogenic pathways, we derive the joint probability 
distribution of the p binary variables, which naturally leads to a set of p logistic regression 
models with the combined pxp coefficient matrix being symmetric. We propose sparse logistic 
regression with a lasso penalty term and extend it to account for the spatial correlation along 
the genome. This extension together with the enforcement of symmetry of the coefficient 
matrix produces a group selection effect, which enables LogitNet to make good use of spatial 
correlation when inferring the edges. 

LogitNet is related to the work by Ravikumar et al. (2009), which also utilized sparse 
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logistic regression to construct a network based on high dimensional binary variables. The 
basic idea of Ravikumar et al. is the same as that of Meinshausen and Buhlmann's (2006) 
neighborhood selection approach, in which sparse logistic regression was performed for each 
binary variable given all others. Sparsity constraint was then imposed on each neighbor- 
hood and the sparse regression was performed for each binary variable separately. Thus, 
the symmetry of conditional dependence obtained from regressing variable Xr on variable 
Xg and from regressing Xs on X^ is not guaranteed. As such, it can yield contradictory 
neighborhoods, which makes interpretation of the results difficult. It also loses power in 
detecting dependencies, especially when the sample size is small. The proposed LogitNet, 
on the other hand, makes use of the symmetry, which produces compatible logistic regression 
models for all variables and has thus achieved a more coherent result with better efficiency 
than the Ravikumar et al. approach. We show by intensive simulation studies that LogitNet 
performs better in terms of false positive rate and false negative rate of edge detection. 

The rest of the paper is organized as follows. In section 2, we will present the model, 
its implementation and the selection of the penalty parameter. Simulation studies of the 
proposed method and the comparison with the Ravikumar et al. approach are described 
in Section 3. Real genomic instability data from breast cancer samples is used to illustrate 
the method and the results are described in Section 4. Finally, we conclude the paper with 
remarks on future work. 

2. Methods 

2.1 LogitNet Model and Likelihood Function 

Consider a p x 1 vector of binary variables X^ = {Xi , . . . , Xp) for which we are interested in 
inferring the conditional dependencies. Here the superscript T is a transpose. The pattern of 
conditional dependencies between these binary variables can be described by an undirected 
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graph Q = {V,E), where ^ is a finite set of vertices, that are associated with 

binary variables {Xi, . . . , Xp); and E is a set of pairs of vertices such that each pair in E are 
conditionally dependent given the rest of binary variables. We assume that the edge set E 
doesn't contain cycles, i.e., no path begins and ends with the same vertex. For example, in 
a set of four vertices, if the edge set includes (1,2), (2,3), and (3,4), it can't include the edge 
(1,4) or (1,3) or (2,4), as it will form a cycle. Under this assumption, the joint probability 
distribution Pr(X) can be represented as a product of functions of pairs of binary variables. 
We formalize this result in the following proposition: 

Proposition 1. Let V = {1, . . . ,p} and X_(^r,s) denote the vector of binary variables X 
excluding Xj. and X^ for r,s E V . Define the edge set 

E = {{r,s)\ Pr(X,,,X,|X_(,,,,)) ^ Pr(X,|X_(,,,)) Pr(X,|X_(,,,,)); r, s G V,r < s}, 

and \E\ = K. If Q doesn't contain cycles, then there exist functions {hk, k = 1, . . . , K} such 
that 

K 

PT{X) = l[h,{Xr„X,,), 

k=l 

where (r^, s^) G E for k = 1, . . . , K . 

The proof of Proposition 1 is largely based on the Hammersley and Clifford theorem 
(Lauritzen, 1996) and given in Supplementary Appendix A. 

Assuming Pt{X) is strictly positive for all values of X, then the above probability distri- 
bution leads to the well known quadratic exponential model 

Fr{X = x)=A-^exp{x^e + z'^K), (1) 

where z"^ = (xiX2, X1X3, . . . , Xp-iXp), 9'^ = {9i,...,9p), = (ki2, K13, • • • , K.{p-i)p), and A is 
a normalization constant such that A = Xlxi=o ' ' ' =0 exp(x"^^ + k). 
Under this model, the zero values in /t are equivalent to the conditional independence for 
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the corresponding binary variables. The following proposition describes this result and the 
proof is given in Supplementary Appendix B. 

Proposition 2. If the distribution on X is (QP, then Xr AL Xs\ X_(^r,s) if and only if 
Krs = 0, for 1 < r < s < p. 

As the goal of graphical model selection is to infer the edge set E which represents the 
conditional dependence among all the variables, the result of Proposition 2 implies that we 
can infer the edge between a pair of events, say X^ and Xs, based on whether or not Urs is 
equal to 0. Interestingly, under model ([1]), k can also be interpreted as a conditional odds 
ratio. This can be seen from 





= 1 3^1, • 


• ; -^S—ly -^S+l ) • • 






= 0|xi, . 


• ; -^S— 1 ; -^S+l ) • • 





Taking the log transformation of the left hand side of this equation results the familiar form 
of a logistic regression model, where the outcome is the jth binary variable and the predictors 
are all the other binary variables. Doing this for each of Xi,X2, ■ ■ ■ ,Xp, we obtain p logistic 
regressions models: 

logit{Pr(a;i = l|x2, . . . ,a;p)} = 6i + K12X2 + ■ ■ ■ + KipXp, 

; (2) 

logit{Pr(xp = . . . ,a;p_i)} = KipXi + . . . + K^p_i)pXp_i + 9p. 
The matrix of all of the regression coefficients from p logistic regression models can then be 
row combined as 

/ f> \ 

fi 1^12 ■ ■ ■ f^ip 



B 



\ i^ip 



l^2p 



J 



with matrix elements defined by Prs for the rth row and the sth column of B. It is easy to see 
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that the B matrix is symmetric, i.e., Prs = Psr = K,rs, i ^ j under model ([2]). Vice versa, the 
symmetry of B ensures the compatibihty of the p logistic conditional distributions ([2]), and 
the resulting joint distribution is the quadratic exponential model ([I]) (Joe and Liu, 1986). 
Thus, to infer the edge set E of the graphical model, i.e., non-zero off-diagonal entries in i3, 
we can resort to regression analysis by simultaneously fitting the p logistic regression models 
in ([2]) with symmetric B. 

Specifically, let Xnxp denote the data which consists of n samples each measured with 
p-variate binary events. We also define two other variables mainly for the ease of the 
presentation of the likelihood function: (1) Y is the same as X but with Os replaced with 
-Is; (2) X^,r = 1, . . . ,p same as X but with rth column set to 1. We propose to maximize 
the joint log likelihood of the p logistic regressions in ([2]) as follows: 

p n 

^(^) = - E E { 1 + ^M-X' [hmr^^- y^r) } . (3) 
r=l i=l 

where = (a:[]^, . . . , x^^); and B[r,] = {Pri, Prp)- Note, here we have the constraints 

Prs = Psr for 1 < r < s < p; and now represents the intercept 6*^ of the rth regression. 

Recall that our interest is to infer oncogenic pathways based on genome instability profiles 
of tumor samples. Most often, we are dealing with hundreds or thousands of genes and 
only tens or hundreds of samples. Thus, regularization on parameter estimation becomes 
indispensable as the number of variables is larger than the sample size, p > n. In the past 
decade, ii norm based sparsity constraints such as lasso (Tibshirani 1996) have shown 
considerable success in handling high-dimension-low-sample-size problems when the true 
model is sparse relative to the dimensionality of the data. Since it is widely believed that 
genetic regulatory relationships are intrinsically sparse (Jeong et al. 2001; Gardner et al. 
2003), we propose to use ii norm penalty for inferring oncogenic pathways. The penalized 
loss function can be written as: 
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P P 



(4) 



r=l s=r+l 



Note that £i-norm penalty is imposed on all off-diagonal entries of B matrix simultaneously 
to control the overall sparsity of the joint logistical regression model, i.e., only a limited 
number of (3rs, r ^ s will be non-zero. We then estimate B by B{X) := arg ming/^"*''"(jB). In 
the rest of the paper, we refer to the model defined in (jlj) as LogitNet model, B{X) as the 
LogitNet estimator and (3rs{X) as the rsth element of B{X). 

As described in the Introduction, the LogitNet model is closely related to the work by 
Ravikumar et al. (2009) which fits p lasso logistic regressions separately (hereafter referred 
to as SepLogit). Our model, however, differs in two aspects: (1) LogitNet imposes the lasso 
constraint for the entire network while SepLogit does it for each neighborhood; (2) LogitNet 
enforces symmetry when estimating the regression coefficients while SepLogit doesn't, so 
for LogitNet there are only about half of the parameters needed to be estimated as for 
SepLogit. As a result, the LogitNet estimates are more efficient and the results are more 
interpretable than SepLogit. 

2.2 Model fitting 

In this section, we describe an algorithm for obtaining the LogitNet estimator B{\). The 
algorithm extends the gradient descent algorithm (Genkin et al. 2007) to enforce the symme- 
try of B. Parameters are updated one at a time using a one-step Newton-Raphson algorithm, 
in the same spirit as the shooting algorithm (Fu, 1998) and the coordinate descent algorithm 
(Friedman et al., 2007a) for solving the general linear lasso regressions. 

More specifically, let /(/?rs) and l{l3rs) be the first- and second- partial derivatives of log- 
likehhood 1{B) with respect to /^^s, 




n 



n 
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where Rr = X^[i,]f]'^[r,]Y[i,r]. Under the Newton-Raphson algorithm, the update for the 
estimate (3rs is A(3rs = —i{f^rs)/'l{Prs)- For the penahzed hkehhood (jlj), the update for (3rs{X) 
is 

jlasso (a \ 
\nlasso _ ''A KPrsJ 

= A^,, - y-^sgn(/?„), (5) 

''{Prs) 

where sgn{(5rs) is a sign function, which is 1 if (3rs is positive and -1 if Prs is negative. The 
estimates are also thresholded such that if an update overshoots and crosses the zero, the 
update will be set to 0. If the current estimate is 0, the algorithm will try both directions 
by setting sgn to be 1 and -1, respectively. By the convexity of (jlj), the update for both 
directions can not be simultaneously successful. If it fails on both directions, the estimate 
will be set to 0. The algorithm also takes other steps to make sure the estimates and the 
numerical procedure are stable, including limiting the update sizes and setting the upper 
bounds for / (Zhang and Oles 2001). See Supplemental Appendix C for more details of the 
algorithm. 

To further improve the convergence speed of the algorithm, we utilize the Active-Shooting 
idea proposed by Peng et al. (2009a) and Friedman et al. (2009). Specifically, at each iteration, 
we define the set of currently non-zero coefficients as the current active set and conduct the 
following two steps: (1) update the coefficients in the active set until convergence is achieved; 
(2) conduct a full loop update on all the coefficients one by one. We then repeat (1) and (2) 
until convergence is achieved on all of the coefficients. Since the target model in our problem 
is usually very sparse, this algorithm achieves a very fast convergence rate by focusing on 
the small subspace whose members are more likely to be in the model. 

We note that in equation (5) the regularization shrinks the estimate towards zero by the 
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amount determined by the penahy parameter A and that each parameter is not penahzed by 
the same amount: A is weighted by the variance l{(3rs) of (3rs- In other words, parameter 
estimates that have larger variances will be penalized more than the ones that have smaller 
variances. It turns out that this type of penalization is very useful, as it would also offer us 
ways to account for the other features of the data. In the next section we show a proposal 
for adding another weight function to account for spatial correlations in genomic instability 
data. 

2.3 Spatial correlation 

Spatial correlation of aberrations is common in genomic instability data. When we perform 
the regression of X^. on all other binary variables, loci that are spatially closest to the Xr are 
likely the strongest predictors in the model and will explain away most of the variation in Xj.. 
The loci at the other locations of the same or other chromosomes, even if they are correlated 
with Xr, may be left out in the model. Obviously this result is not desirable because our 
objective is to identify the network among all of these loci (binary variables), in particular 
those that are not close spatially as we know them already. 

One approach to accounting for this undesirable spatial effect is to downweight the effect 
of the neighboring loci of Xr when regressing Xr on the rest of the loci. Recall that in 
Section 2.2, we observed that the penalty term in ([5]) is inversely weighted by the variance 
of the parameter estimates. Following the same idea, we can achieve the downweighting 
of neighboring loci by letting the penalty term be proportional to the strength of their 
correlations with Xr- This way we can shrink the effects of the neighboring loci with strong 
spatial correlation more than those that have less or no spatial correlation. Specifically, the 
update for the parameter estimate jSrs in can be written as 

111 

APlT° = APrs-\j-f-SgniPrs), 

^ \Prs ) 

where Wrs is the weight for the spatial correlation. Naturally the weight Wrs for Xr and Xg 
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on different chromosomes is 1 and for Xj. and Xg on the same chromosome should depend on 
the strength of the spatial correlation. As the spatial correlation varies across the genome, 
we propose the following adaptive estimator for Wrs- 

(1) Calculate the odds ratio a between every locus in the chromosome with the target locus 
by univariate logistic regression. 

(2) Plot the a's by their genomic locations and smooth the profile by loess with a window 
size of 10 loci. 

(3) Set the smoothed curve 5 to as soon as the curve starting from the target locus hits 
0. Here "hits 0" is defined as 5 < e, where e = median^lSr — 5r+i|. 

(4) Set the weight w = exp(a). 

It is worth noting that the above weighting scheme together with the enforcement of 
the symmetry of B in LogitNet encourages a group selection effect, i.e., highly correlated 
predictors tend to be in or out of the model simultaneously. We illustrate this point with a 
simple example system of three variables Xi,X2, and X^. Suppose that X2 and X3 are very 
close on the genome and highly correlated; and Xi is associated with X2 and X3 but sits on 
a different chromosome. Under our proposal, the weight matrix w is 1 for all entries except 
■"^23 = "^^32 = ci, which is a large value because of the strong spatial correlation between X2 
and X3. Then, for LogitNet, the joint logistic regression model 

logit (Xi ) ~ /3n + P12X2 + P13X3 , (6) 

l0git(X2) ~ /3i2Xi + (322 + (7) 

logit(X3) ~ AaXi + (323X2 + (333, (8) 

is subject to the constraint IPul + IPnl + a\P23\ < s. Because of the large value of a, P23 will 
likely be shrunk to zero, which ensures (3i2 and (3i3 to be nonzero in ([7]) and ([H]), respectively. 
With the symmetry constraint imposed on B matrix, we also enforce both (3i2 and (3i3 to be 
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selected in (jH]). This grouping effect would not happen if we fit only the model (El) for which 
only one of /5i2 and /5i3 would likely be selected (Zou and Hastie 2005), nor would it happen 
if we didn't have a large value of a because (323 would have been the dominant coefficient in 
models (I7j) and (IHl). Indeed, the group selection effect of LogitNet is clearly observed in the 
simulation studies conducted in Section 3. 

2.4 Penalty Parameter Selection 

We consider two procedures for selecting the penalty parameter A: cross validation (CV) and 
Bayesian Information Criterion (BIG). 

2.4.1 Cross Validation. After we derive the weight matrix w based on the whole data set, 
we divide the data into V non-overlapping equal subsets. Treat the v^^ subset X^""^ as the 
f*^ test set, and its complement X'^""^ as the f*'^ training set. For a given A, we first obtain 
the LogitNet estimate B^{X) with the weight matrix w on the f*'* training set X~^^\ Since 
in our problem the true model is usually very sparse, the degree of regularization needed 
is often high. As a result, the value of B'"{X) could be shrunk far from the true parameter 
values. Using such heavily shrunk estimates for choosing A from cross validation often results 
in severe over- fitting (Efron et al. 2004). Thus, we re-estimate B using the selected model 
in the vth training set without any shrinkage and use it in calculating the log-likelihood for 
the v^^ test set. The un-shrunk estimates Sl^^A) can be easily obtained from our current 
algorithm for the regularized estimates with modifications described below: 

(1) Define a new weight matrix such that = 1, if Prt^ 7^ 0; and w^^ = M, if pit^ = 0, 
where M = max{wrs}- 

(2) Fit the LogitNet model using the new weight matrix w, thus {Prs\w^s = 1} are not 
penalized in the model and all other j3rs are shrunk to 0. The result is ;Bins(A). 
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We then calculate the joint log likelihood of logistic regressions using the un-shrunk es- 
timates on the v^^ test set l{&^ns{\)\X^'"'>) according to formula ([2]). The optimal Acv = 
argmax,Er=i^(^-UA)|X(^)). 

In order to further control the false positive findings due to stochastic variation, we employ 
the cv. vote procedure proposed by Peng et al. (2009b). The idea is to derive the "consensus" 
result of the models estimated from each training set, as variables that are consistently 
selected by different training sets should be more likely to appear in the true model than 
the ones that are selected by one or few training sets. Specifically, for a pair of rth and sth 
variables, we define 

[ 1, if Er=i^(3inU(Acv)^0)>V/2; 

Srs(Acv) = < (9) 

I 0, otherwise. 
We return {sr.<i(Acv)} as our final result. 

2.4.2 BIC. We can also use BIG to select A: 

Abic = argmm | -2/(^.„.(A)|X) + log(n) J] J(&),,„(A) ^ 0) I (10) 

I r<s ) 

where Y.r<sKMsU>^) 7^ 0) gives the dimension of the parameter space of the selected 
model. Here again, un-shrunk estimates Bu2s{^) is used to calculate the log likelihood. 

3. Simulation Studies 

In this section, we investigate the performance of the LogitNet method and compare it with 
SepLogit which fits p separate lasso logistic regressions all using the same penalty parameter 
value (Ravikumar et al., 2009). We use R package glmnet to compute the SepLogit solution 
and the same weight matrix as described in Section 2.3 to account for the spatial correlation. 
In addition, since the SepLogit method does not ensure the symmetry of the estimated B 
matrix, there will be cases that (3^3 = but (3sr 7^ or vice versa. In these cases we interpret 
the result using the "OR" rule: Xj. and Xg are deemed to be conditionally dependent if 
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either Prs or is 0. We have also used the "AND" rule, i.e. Xr and Xg are deemed to be 
conditionally dependent if both j3rs and jSgr are 0. The "AND" rule always yields very high 
false negative rate. Due to space limitations, we omit the results for the "AND" rule. 

3.1 Simulation setting 

We generated background aberration events with spatial correlation using a homogenous 
Bernoulli Markov model. It is part of the instability-selection model (Newton et al. 1998), 
which hypothesizes that the genetic structure of a progenitor cell is subject to chromosomal 
instability that causes random aberrations. The Markov model has two parameters: 6 and 
u, where 6 is the marginal (stationary) probability at a marker locus and u measures the 
strength of the dependence between the aberrations. So 6 plays the role of a background 
or sporadic aberration when u affects the overall rate of change in the stochastic process. 
Under this model, the infinitesimal rate of change from no aberration to aberration is z/(5, 
and from aberration to no aberration is i^(l — 6). We then super-imposed the aberrations at 
disease loci, which were generated according to a pre-determined oncogenic pathway, on the 
background aberration events. The algorithm for generating an aberration indicator vector 
X'^ = {Xi, ■ ■ ■ , Xp) is given below: 

1. Specify the topology of an oncogenic pathway for the disease loci and the transitional 
probabilities among the aberrations on the pathway. The K disease loci are indexed by 
{si, ■ ■ ■ ,sk}, where Si e {1, . . . ,p} for i = 1, . . . , K. 

2. Generate background aberrations denoted by a p x 1 vector Z according to the homoge- 
nous Bernoulli Markov process with preselected values of 5 = 0.05 and u = 15. 

3. Generate aberration events at disease loci following the oncogenic pathway specified in 
Step 1. This is denoted by a p x 1 vector U, where indices {si, ■ ■ ■ , sk} are disease loci. If 
disease locus Si has an aberration (f/^. = 1), we also assign aberrations to its neighboring 
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loci Ut = 1, for t E [si — ai,Si + bi], where ai and bi are independently sampled from 
Uniform[0, 30]. The rest of the elements in U are 0. 
4. Combine the aberration events at disease loci and the background aberrations by assign- 
ing Xi = 1 if Ui + Zi > and a Ui = Zi = 0, hi i = 1, ... ,p. 

We set n = 200 and p = 600 to mimic the dimension of the real data set used in Section HI 
so V = {1,...,600}. We assume the 600 marker loci fall into 6 different chromosomes 
with 100 loci on each chromosome. We consider two different oncogenic pathway models: 
a chain shape and a tree shape (see Figured]). Each model contains 6 aberration events: 
M = {A, B, C, D, E, F}. Without loss of generality, we assume these 6 aberrations are located in 
the middle of each chromosome, so the indices of A-F are = 50, = 150, ■ ■ ■, Sp = 550, 
respectively. For any u G M, Xg^ = 1 means aberration u occurs in the sample. 

We evaluate the performance of the methods by two metrics: the false positive rate 
(FPR) and the false negative rate (FNR) of edge detection. Denote the true edge set E = 
{{u,v)\Xu and X^ are conditionally dependent , n G V", f G V}. We define a non-zero /3rs a 
false detection if its genome location indices (r, s) is far from the indices of any true edge: 

|r - Jul + |s - /v| > 30, V(u, v) G E. 

For example, in Figure [3] red dots that do not fall into any grey diamond are considered false 
detection. A cutoff value of 30 is used here because in the simulation setup (see Step 3) we 
set the maximum aberration size around the disease locus to be 30. Similarly, we define a 
conditionally dependent pair (u, v) G -E is missed, if there is no non-zero f3 falling in the 
grey diamond. We then calculate FPR as the number of false detections divided by the total 
number of non-zero /3rs, r < s; and calculate FNR as the number of missed (u, v) G -E divided 
by the size of E. 
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3.2 Simulation I — Chain Model 

For the chain model, aberrations A-F occur sequentially on one oncogenic pathway. The aber- 
ration frequencies and transitional probabilities along the oncogenic pathway are illustrated 



in Figure 1(a) The true conditionally dependent pairs in this model are 

E = {(Sa, Sb), (Sb, Sc), {Sc, Sd), (Sd, Se), (Se, Sf)}. 

Based on this chain model, we generated 50 independent data sets. The heatmap of one 
example data matrix 200x600 is shown in Supplemental Figure S-1. We then apply LogitNet 
and SepLogit to each simulated data set for a series of different values of A. Figure [2] shows 
the FPR and FNR of the two methods as a function of A. For both methods, FPR decreases with 
A while FNR increases with A. Comparing the two methods, LogitNet clearly outperforms 
SepLogit in terms of FPR and FNR. For LogitNet, the average optimal total error rate 
(FPR+FNR) across the 50 independent data sets is 0.014 (s.d. =0.029); while the average 
optimal total error rate for SepLogit is 0.211 (s.d. =0.203). Specifically, taking the data set 
shown in the Supplemental Figure S-1 as an example, the optimal total error rate achieved 
by LogitNet on this data set is 0, while the optimal total error achieve by SepLogit is 
0.563 (FPR= 0.563, FNR= 0). The corresponding two coefficient matrices B are illustrated in 
Figure [3l As one can see, there is a large degree of asymmetry in the result of SepLogit: 435 
out of the 476 non-zero Prs have inconsistent transpose elements, Psr = 0. On the contrary, 
by enforcing symmetry our proposed approach LogitNet has correctly identified all five true 
conditionally dependent pairs in the chain model. Moreover, the non-zero /Srs^s plotted by 
red dots tend to be clustered within the grey diamonds. This shows that LogitNet indeed 
encourages group selection for highly correlated predictors, and thus is able to make good 
use of the spatial correlation in the data when inferring the edges. 

We also evaluated the two penalty parameter selection approaches: CV and BIC, for 
LogitNet. Table [1] summarizes the FPR and FNR for CV and BIC. Both approaches performed 
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reasonably well. The CV criterion tends to select larger models than the BIG, and thus has 
more false positives and fewer false negatives. The average total error rate (FPR+FNR) for 
CV is 0.079, which is slightly smaller than the total error rate for BIG, 0.084. 

[Table 1 about here.] 

3.3 Simulation II — Tree Model 

For the tree model, we used the empirical mutagenic tree derived in Beerenwinkel et al. 



(2004) for a HIV data set. The details of the model are illustrated in Figure 1(b) The true 
conditionally dependent pairs in this model are 

E = {(Sa, Sb), (Sb, Se), (Sa, Sc), (Sc, Sf), (Sa, Sd)}. 

The results of LogitNet and SepLogit for these data sets are summarized in Figure HI 
Again, LogitNet outperforms SepLogit in terms of FPR and FNR. The average optimal total 
error rate (FPR+FNR) achieved by LogitNet across the 50 independent data sets is 0.163 
(s.d. =0.106); while the average optimal total error rate for SepLogit is 0.331 (s.d. =0.160), 
twice as large as LogitNet. We also evaluated CV and BIC for LogitNet. The results are 
summarized in Table [H Both CV and BIC give much higher FNRs under the tree model 
than under the chain model. This is not surprising as some transition probabilities between 
aberration events along the pathway are smaller in the tree model than in the chain model. 
As in the chain model, we also observe that BIC gives smaller FPR and higher FNR than CV, 
suggesting CV tends to select larger models and thus yields less false negatives but with 
more false positives in detecting edges. 



4. Application to a Breast Cancer Data Set 

In this section, we illustrate our method using a genomic instability data set from breast 
cancer samples. In this data set the genomic instability is measured by loss of heterozygosity 
(LOH), one of the most common alterations in breast cancer. An LOH event at a marker 
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locus for a tumor is defined as a locus that is homozygous in the tumor and heterozygous in 
the constitutive normal DNA. To gain a better understanding of LOH in breast cancer, Loo 
et al. (2008) conducted a study which used the GeneChip Mapping lOK Assay (Affymetrix, 
Santa Clara, CA) to measure LOH events in 166 breast tumors derived from a population- 
based sample. The array contains 9706 SNPs, with 9670 having annotated genome locations. 
Approximately 25% of the SNPs are heterozygous in normal DNA, which means LOH can not 
be detected in the remaining 75% of SNPs, i.e., the SNPs are non-informative. To minimize 
the missing rate for individual SNPs, we binned the SNPs by cytogenetic bands (cytoband). 
A total of 765 cytobands are covered by these SNPs. For each sample, we define the LOH 
status of a cytoband to be 1 if at least 2 informative SNPs in this cytoband show LOH 
and otherwise. We then remove 164 cytobands which either have missing rates above 20%, 
or show LOH in less than 5 samples to exclude rare events. The average LOH rate in the 
remaining 601 cytobands is 12.3%. 

Despite our effort to minimize missingness in the data, 7.5% of values are still missing 
in the remaining data. We use the multiple imputation approach to impute the missing 
values based on the conditional probability of LOH at the target SNP given the available 
LOH status at adjacent loci. If both adjacent loci are missing LOH status, we will impute the 
genotype using only the marginal probability of the target SNP. See Supplemental Appendix 
D for details of the multiple imputation algorithm. 

We then generate 10 imputed data sets. We apply LogitNet on each of them and use 10- 
fold CV for penalty parameter selection. The total number of edges inferred on each imputed 
data set is summarized in Table [2l We can see that two imputation data sets have far more 
edges detected than the rest of imputation data sets. This suggests that there is a substantial 
variation among imputed data sets and we can not reply on a single imputed data set. Thus, 
we examine the consensus edges across different imputation runs. There are 3 edges inferred in 
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at least 4 imputed datasets (Table[3l). Particularly, interaction between llq24.2 and 13q21.33 
has been consistently detected in all of the 10 imputation data sets. Detailed numbers of 
LOH frequencies at these two cytobands are shown in Supplementary Table S-1. Cytoband 
llq24.2 harbors the CHEKl gene, which is an important gene in the maintenance of genome 
integrity and a potential tumor suppressor. DACHl is located on cytoband 13q21.33 and 
has a role in the inhibition of tumor progression and metastasis in several types of cancer 
(e.g., Wu et al., 2009). Both CHEKl and DACHl inhibit cell cycle progression through 
mechanisms involving the cell cycle inhibitor, CDKNIA. See Supplemental Figure S-2 for 
the pathway showing the interaction between CHEKl on llq24.2 and DACHl on 13q21.33. 

[Table 2 about here.] 

[Table 3 about here.] 

5. Final Remarks 

In this paper, we propose the LogitNet method for learning networks using high dimensional 
binary data. The work is motivated by the interest in inferring disease oncogenic pathways 
from genomic instability profiles (binary data). We show that under the assumption of 
no cycles for the oncogenic pathways, the dependence parameters in the joint probability 
distribution of binary variables can be estimated by fitting a set of logistic regression models 
with a symmetric coefficient matrix. For high-dimension-low-sample-size data, this result is 
especially appealing as we can use sparse regression techniques to regularize the parameter 
estimation. We implemented a fast algorithm for obtaining the LogitNet estimator. This 
algorithm enforces the symmetry of the coefficient matrix and also accounts for the spatial 
correlation in the genomic instability profiles by a weighting scheme. With extensive simu- 
lation studies, we demonstrate that this method achieves good power in edge detection, and 
also performs favorably compared to an existing method. 
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In LogitNet, the weighting scheme together with the enforcement of symmetry encourage 
a group selection effect, i.e., highly spatially correlated variables tend to be in and out of 
the model simultaneously. It is conceivable that this group selection effect may be further 
enhanced by replacing the lasso penalty with the elastic net penalty proposed by Zou 
and Haste (2005) as Ai J2i<r<s<p \(^rs\ + ^2 J2i<r<s<p (^rs- The square £2 norm penalty may 
facilitate group selection within each regularized logistic regression. More investigation along 
this line is warranted. 

R package LogitNet is available from the authors upon request. It will also be made 
available through CRAN shortly. 

[Figure 1 about here.] 
[Figure 2 about here.] 
[Figure 3 about here.] 
[Figure 4 about here.] 
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(a) The Chain Model 
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(b) The Tree Model 

Figure 1. (a) A chain sliape oncogenic pathway; (b) A tree shape oncogenic pathway. 
Numbers in the yellow boxes are the marginal frequencies of the aberration in the disease 
population. The numbers on/blow the arrows are the conditional probabilities between 
mutations along the oncogenic pathway. For example, consider the arrow from Mut A to 
Mut B in panel (a), the "P = 0.6" above the arrow means: Prob(Mutation B happens | 
Mutation A happens)= 0.6; while the "P' = 0.05" below the arrow means: Prob(Mutation 
B happens | Mutation A does not happen)= 0.05. The remark "P' = 0.05 for all arrows" in 
panel (b) suggests: Prob(The right side mutation happens | the left side mutation does not 
happen)= 0.05. 
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Figure 2. Results of LogitNet and SepLogit for the chain model in Figure 1(a) Each 
grey line represents one of the 50 independent data sets. The solid line is the mean curve 
with the two dashed lines represent mean ± one s.d. 
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(a) Estimated B by LogitNet at the A giving the smallest (b) Estimated B by SepLogit at the A giving the small- 
total error rate. est total error rate. 

Figure 3. Results of LogitNet and SepLogit on the example data set shown in Supple- 
mental Figure S-1. Each red dot represents a non-zero Prs- Points in the grey diamond are 
deemed as correct detections. The dashed blue lines indicate the locations of aberration A-F. 
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(b) False negative rate (FNR). 




Figure 4. Results of LogitNet and SepLogit for the tree model in Figure 1(b) Each grey 
line represents one of the 50 independent data sets. The solid line is the mean curve with 
the two dashed lines represent mean ± one s.d. 
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Table 1 

Summary of FPR and FNR for LogitNet for using CV and BIC to choose optimal A. Each entry is the mean (S.D.) 

over 50 independent data sets 

Chain Model Tree Model 

FPR FNR FPR FNR 

CV 0.079 (0.049) (0) 0.058 (0.059) 0.280 (0.17) 
BIC 0.025 (0.035) 0.06 (0.101) 0.024 (0.038) 0.436 (0.197) 
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Table 2 

Number of edges detected in each imputed data set. 



Imputation Index 123 4 56 7 8 910 
# of edges detected 2 2 5 219 3 10 1 114 2 1 
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Table 3 

Annotation for the edges inferred in at least 4 out of 10 imputed datasets. 



Cytoband pair 


Frequency of selection 


llq22.3, 13q33.1 


6 


llq24.2, 13q21.33 


10 


llq25, 13ql4.11 


4 



